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Abstract 

Recent progress in using recurrent neural networks 
(RNNs)for image description has motivated the exploration 
of their application for video description. However, while 
images are static, working with videos requires modeling 
their dynamic temporal structure and then properly inte¬ 
grating that information into a natural language descrip¬ 
tion. In this context, we propose an approach that success¬ 
fully takes into account both the local and global temporal 
structure of videos to produce descriptions. First, our ap¬ 
proach incorporates a spatial temporal 3-D convolutional 
neural network (3-D CNN) representation of the short tem¬ 
poral dynamics. The 3-D CNN representation is trained on 
video action recognition tasks, so as to produce a represen¬ 
tation that is tuned to human motion and behavior. Sec¬ 
ond we propose a temporal attention mechanism that al¬ 
lows to go beyond local temporal modeling and learns to 
automatically select the most relevant temporal segments 
given the text-generating RNN. Our approach exceeds the 
current state-of-art for both BLEU and METEOR metrics 
on the Youtube2Text dataset. We also present results on a 
new, larger and more challenging dataset of paired video 
and natural language descriptions. 


1. Introduction 

The task of automatically describing videos containing 
rich and open-domain activities poses an important chal¬ 
lenges for computer vision and machine learning research. 
It also has a variety of practical applications. For example, 



A man is shooting a gun 


Figure 1. High-level visualization of our approach to video de¬ 
scription generation. We incorporate models of both the local 
temporal dynamic (i.e. within blocks of a few frames) of videos, 
as well as their global temporal structure. The local structure is 
modeled using the temporal feature maps of a 3-D CNN, while 
a temporal attention mechanism is used to combine information 
across the entire video. For each generated word, the model can 
focus on different temporal regions in the video. For simplicity, we 
highlight only the region having the maximum attention above. 

every minute, 100 hours of video are uploaded to YouTube]^ 
However, if a video is poorly tagged, its utility is dramati¬ 
cally diminished ||24l. Automatic video description gener¬ 
ation has the potential to help improve indexing and search 
quality for online videos. In conjunction with speech syn¬ 
thesis technology, annotating video with natural language 
descriptions also has the potential to benefit the visually im¬ 
paired. 

While image description generation is already consid¬ 
ered a very challenging task, the automatic generation of 
video description carries additional difficulties. Simply 
dealing with the sheer quantity of information contained in 
video data is one such challenge. Moreover, video descrip¬ 
tion involves generating a sentence to characterize a video 

^https://www.youtube.com/yt/press/statistics. 
html accessed on 2015-02-06. 






clip lasting typically 5 to 10 seconds, or 120 to 240 frames. 
Often such clips contain complex interactions of actors and 
objects that evolve over time. All together it amounts to 
a vast quantity of information, and attempting to represent 
this information using a single, temporally collapsed feature 
representation is likely to be prone to clutter, with tempo¬ 
rally distinct events and objects being potentially fused in¬ 
coherently. It is therefore important that an automatic video 
description generator exploit the temporal structure under¬ 
lying video. 

We argue that there are two categories of temporal struc¬ 
ture present in video: (1) local structure and (2) global struc¬ 
ture. Local temporal structure refers to the fine-grained mo¬ 
tion information that characterizes punctuated actions such 
as “answering the telephone” or “standing up”. Actions 
such as these are relatively localized in time, evolving over 
only a few consecutive frames. On the other hand, when we 
refer to global temporal structure in video, we refer to the 
sequence in which objects, actions, scenes and people, etc. 
appear in a video. Video description may well be termed 
video summarization, because we typically look for a sin¬ 
gle sentence to summarize what can be a rather elaborate 
sequence of events. Just as good image descriptions often 
focus on the more salient parts of the image for description, 
we argue that good video description systems should selec¬ 
tively focus on the most salient features of a video sequence. 

Recently, Venugopalan et al. ED used a so-called 
encoder-decoder neural network framework m to automat¬ 
ically generate the description of a video clip. They ex¬ 
tracted appearance features from each frame of an input 
video clip using a previously trained convolutional neural 
network 12^ . The features from all the frames, or subsam¬ 
pled frames, were then collapsed via simple averaging to 
result in a single vector representation of the entire video 
clip. Due to this indiscriminate averaging of all the frames, 
this approach risks ignoring much of the temporal structure 
underlying the video clip. For instance, it is not possible 
to tell the order of the appearances of two objects from the 
collapsed features. 

In this paper, we introduce a temporal attention mech¬ 
anism to exploit global temporal structure. We also aug¬ 
ment the appearance features with action features that en¬ 
code local temporal structure. Our action features are de¬ 
rived from a spatio-temporal convolutional neural network 
(3-D CNN) ISSmilll. The temporal attention mechanism 
is based on a recently proposed soft-alignment method ifTIl 
which was used successfully in the context of machine 
translation. While generating a description, the temporal at¬ 
tention mechanism selectively focuses on a small subset of 
frames, making it possible for the generator to describe only 
the objects and/or activities in that subset (see Fig.l^for the 
graphical illustration). Our 3-D CNN, on the other hand, 
starts from both temporally and spatially local motion de¬ 


scriptors of video and hierarchically extracts more abstract 
action-related features. These features preserve and empha¬ 
size important local structure embedded in video for use by 
the description generator. 

We evaluate the effectiveness of the proposed mecha¬ 
nisms for exploiting temporal structure on the most widely 
used open-domain video description dataset, called the 
Youtube2Text dataset [HI, which consists of 1,970 video 
clips with multiple descriptions per video. We also test the 
proposed approaches on a much larger, and more recently 
proposed, dataset based on the descriptive video service 
(DVS) tracks in DVD movies (381, which contains 49,000 
video clips. 

Our work makes the following contributions: 1) We pro¬ 
pose the use of a novel 3-D CNN-RNN encoder-decoder 
architecture which captures local spatio-temporal informa¬ 
tion. We find that despite the promising results generated by 
both prior work and our own here using static frame CNN- 
RNN video description methods, our experiments suggest 
that it is indeed important to exploit local temporal struc¬ 
ture when generating a description of video. 2) We pro¬ 
pose the use of an attention mechanism within a CNN- 
RNN encoder-decoder framework for video description and 
we demonstrate through our experiments that it allows fea¬ 
tures obtained through the global analysis of static frames 
throughout the video to be used more effectively for video 
description generation. Furthermore, 3) we observe that the 
improvements brought by exploiting global and local tem¬ 
poral information are complimentary, with the best perfor¬ 
mance achieved when both the temporal attention mecha¬ 
nism and the 3-D CNN are used together. 

2. Video Description Generation Using an 
Encoder-Decoder Framework 

In this section, we describe a general approach, based 
purely on neural networks to generate video descriptions. 
This approach is based on the encoder-decoder frame¬ 
work (91, which has been successfully used in machine 
translation EslIllID as well as image caption genera- 
tion ll)lfT2ll4^l44lfT8l. 

2.1. Encoder-Decoder Framework 

The encoder-decoder framework consists of two neu¬ 
ral networks; the encoder and the decoder. The encoder 
network 0 encodes the input x into a continuous-space 
representation which may be a variable-sized set V = 
{vi,..., v^} of continuous vectors: 

V = {vi,...,v„} = (/)(x). 

The architecture choice for the encoder 0 depends on the 
type of input. For example, in the case of machine transla¬ 
tion, it is natural to use a recurrent neural network (RNN) 


for the encoder, since the input is a variable-length sequence 
of symbols (331 |9l. With an image as input, a convolutional 
neural network (CNN) is another good alternative l44l . 

The decoder network generates the corresponding out¬ 
put y from the encoder representation V. As was the case 
with the encoder, the decoder’s architecture must be chosen 
according to the type of the output. When the output is a 
natural language sentence, which is the case in automatic 
video description, an RNN is a method of choice. 

The decoder RNN runs sequentially over the output 
sequence. In brief, to generate an output y, at each step t 
the RNN updates its internal state based on its previous 
internal state h^-i as well as the previous output yt-i and 
the encoder representation V, and then outputs a symbol yt'. 


yt 

ht 


= V") 


( 1 ) 


where for now we simply note as ^ the function updating 
the RNN’s internal state and computing its output. The 
RNN is run recursively until the end-of-sequence symbol 
is generated, i.e., yt = (eos). 

In the remaining of this section, we detail choices for the 
encoder and decoder for a basic automatic video description 
system, taken from (411 and on which our work builds. 

2.2. Encoder: Convolutional Neural Network 


Deep convolutional neural networks (CNNs) have re¬ 
cently been successful at large-scale object recognition (22l 
[34l. Beyond the object recognition task itself, CNNs trained 
for object recognition have been found to be useful in a va¬ 
riety of other computer vision tasks such as object local¬ 
ization and detection (see, e.g., (291 ). This has opened a 
door to a flood of computer vision systems that exploit rep¬ 
resentations from upper or intermediate layers of a CNN as 
generic high-level features for vision. For instance, the ac¬ 
tivation of the last fully-connected layer can be used as a 
fixed-size vector representation (20l, or the feature map of 
the last convolutional layer can be used as a set of spatial 
feature vectors El. 

In the case where the input is a video clip, an image- 
trained CNN can be used for each frame separately, result¬ 
ing in a single vector representation of the i-th frame. 
This is the approach proposed by El, which used the con¬ 
volutional neural network from (22l. In our work here, 
we will also consider using the CNN from (34l, which has 
demonstrated higher performance for object recognition. 

2.3. Decoder: Long Short-Term Memory Network 

As discussed earlier, it is natural to use a recurrent neu¬ 
ral network (RNN) as a decoder when the output is a natural 
language sentence. This has been empirically confirmed in 
the contexts of machine translation Giisin, image cap¬ 
tion generation (42l|44l and video description generation in 


open El and closed fl2l domains. Among these recently 
successful applications of the RNN in natural language gen¬ 
eration, it is noticeable that most of them ESIlllIllSIlll, 
if not all, used long short-term memory (LSTM) units 
or their variant, gated recurrent units (GRU) la. In this pa¬ 
per, we also use a variant of the LSTM units, introduced in 
El, as the decoder. 

The LSTM decoder maintains an internal memory state 
Ct in addition to the usual hidden state hf of an RNN (see 
Eq. ([T])). The hidden state hf is the memory state Ct modu¬ 
lated by an output gate: 


ht = Ot 0Ct, 

where 0 is an element-wise multiplication. The output gate 
Of is computed by 


Of — cr(WoE [yt-i] + Uoht-i + Ao^ptiy) + bo), 

where a is the element-wise logistic sigmoid function and 
(ft is a time-dependent transformation function on the en¬ 
coder features. Wo, Uo, Ao and bo are, in order, the weight 
matrices for the input, the previous hidden state, the con¬ 
text from the encoder and the bias. E is a word embedding 
matrix, and we denote by E an embedding vector of 
word yt-i. 

The memory state Cf is computed as a weighted sum be¬ 
tween the previous memory state Ct-i and the new memory 
content update Ct : 


Ct = ft 0Ct-i +it 0Ct, 

where the coefficients - called forget and input gates respec¬ 
tively - are given by 

ft = cr(W/E[^t-i] +U/ht_i ^AfcptiV) +b/), 

it = cr(WtE [^t-i] + Utht-i + Aiiptiy) + hi). 

The updated memory content Ct also depends on the current 
input yt-i, previous hidden state ht-i and the features from 
the encoder representation (pt{V): 


Ct — tanh(WcE [^t-i] + Ucht_i + Ac^t{y) + be). 


Once the new hidden state ht is computed, a probability 
distribution over the set of possible words is obtained using 
a single hidden layer neural network 

Pt = softmax(Uptanh(Wp[ht,(/:?t(i^),E [^t-i]] + b^) +d), 

( 2 ) 

where Wp,Up,bp,d are the parameters of this network, 

[... ] denotes vector concatenation. The softmax function 
allows us to interpret pt as the probabilities of the distribu¬ 
tion p(^t I y<ty) over words. 




as 


At a higher level, the LSTM decoder can be written down 


p{vt I y<t,v) 

ht 

Ct 


■^(h(_i,C(_i,yt_i,y). (3) 


It is then trivial to generate a sentence from the LSTM 
decoder. For instance, one can recursively evaluate ip and 
sample from the returned p{yt \ ...) until the sampled yt is 
the end-of-sequence symbol. One can also approximately 
find the sentence with the highest probability by using a 
simple beam search 1331 . 

InEB, Venugopalan et al. used this type of LSTM de¬ 
coder for automatic video description generation. However, 
in their work the feature transformation function ipt con¬ 
sisted in a simple averaging, i.e.. 




(4) 


i=l 


where the ^^’s are the elements of the set V returned by 


the CNN encoder from Sec. 2.2 This averaging effectively 
collapses all the frames, indiscriminate of their temporal re¬ 
lationships, leading to the loss of temporal structure under¬ 
lying the input video. 


3. Exploiting Temporal Structure in Video De¬ 
scription Generation 

In this section, we delve into the main contributions of 
this paper and propose an approach for exploiting both the 
local and global temporal structure in automatic video de¬ 
scription. 

3.1. Exploiting Local Structure: 

A Spatio-Temporal Convolutional Neural Net 

We propose to model the local temporal structure 
of videos at the level of the temporal features V = 
{vi,..., v^} that are extracted by the encoder. Specifi¬ 
cally, we propose to use a spatio-temporal convolutional 
neural network (3-D CNN) which has recently been demon¬ 
strated to capture well the temporal dynamics in video 
clips |39l[l9l. 

We use a 3-D CNN to build the higher-level represen¬ 
tations that preserve and summarize the local motion de¬ 
scriptors of short frame sequences. This is done by first 
dividing the input video clip into a 3-D spatio-temporal 
grid of 16 X 12 x 2 (width x height x timesteps) cuboids. 
Each cuboid is represented by concatenating the histograms 
of oriented gradients, oriented fiow and motion boundary 
(HoG, HoF, MbH) |[T0ll43l with 33 bins. This transforma¬ 
tion is done in order to make sure that local temporal struc¬ 
ture (motion features) are well extracted and to reduce the 
computation of the subsequence 3-D CNN. 





Caption 

Features-Extraction Soft-Attention Generation 


Figure 3. Illustration of the proposed temporal attention mecha¬ 
nism in the LSTM decoder 


Our 3-D CNN architecture is composed of three 3-D 
convolutional layer, each followed by rectified linear activa¬ 
tions (ReLU) and local max-pooling. From the activation of 
the last 3-D convolution-i-ReLU-i-pooling layer, which pre¬ 
serves the temporal arrangement of the input video and ab¬ 
stracts the local motion features, we can obtain a set of tem¬ 
poral feature vectors by max-pooling along the spatial di¬ 
mensions (width and height) to get feature vectors that each 
summarize the content over short frame sequences within 
the video. Finally, these feature vectors are combined, by 
concatenation, with the image features extracted from sin¬ 
gle frames taken at similar positions across the video. Fig.[^ 
illustrates the complete architecture of the described 3-D 
CNN. Similarly to the object recognition trained CNN (see 
Sec. |2.2[ ), the 3-D CNN is pre-train on activity recognition 
datasets. 

3.2. Exploiting Global Structure: 

A Temporal Attention Mechanism 

The 3-D CNN features of the previous section allows us 
to better represent short-duration actions in a subset of con¬ 
secutive frames. However, representing a complete video 
by averaging these local temporal features as in Eq.|^ would 
jeopardize the model’s ability to exploit the video’s global 
temporal structure. 

Our approach to exploiting such non-local temporal 
structure is to let the decoder selectively focus on only a 
small subset of frames at a time. By considering subsets of 
frames in sequence, the model can exploit the temporal or¬ 
dering of objects and actions across the entire video clip and 
avoid confiating temporally disparate events. Our approach 
also has the potential of allowing the model to focus on key 
elements of the video that may have short duration. Meth¬ 
ods that collapse the temporal structure risk overwhelming 
these short duration elements. 

Specifically, we propose to adapt the recently pro¬ 
posed soft attention mechanism from m, which allows 
the decoder to weight each temporal feature vector V = 
{vi,..., v^}. This approach has been used successfully by 
Xu et al. m for exploiting spatial structure underlying an 
















A: Low-level Video Representation 

T=240 15X15X120 crops 


B: 3D Convolutional Networks 
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Figure 2. Illustration of the spatio- 
temporal convolutional neural network 
(3-D CNN). This network is trained for 
activity recognition. Then, only the con¬ 
volutional layers are involved when gen¬ 
erating video descriptions. 


image. Here, we thus adapt it to exploit the temporal struc¬ 
ture of video instead. 

Instead of a simple averaging strategy (as shown in 
Eq. (|^), we take the dynamic weighted sum of the temporal 
feature vectors such that 

n 

i=l 

where = 1 and ’s are computed at each time 

step t inside the LSTM decoder (see Sec. |2.3| ). We refer to 
as the attention weights at time t. 

The attention weight af^ reflects the relevance of the 
i-th temporal feature in the input video given all the previ¬ 
ously generated words, i.e., ... yt-i- Hence, we design 

a function that takes as input the previous hidden state ht_i 
of the LSTM decoder, which summarizes all the previously 
generated words, and the feature vector of the i-th temporal 
feature and returns the unnormalized relevance score ef ^: 

tanh (Wah(_i + U„Vj + ba), 

where w, W^, and are the parameters that are esti¬ 
mated together with all the other parameters of the encoder 
and decoder networks. 

Once the relevance scores for all the frames i = 
1,..., n are computed, we normalize them to obtain the 

= exp |ef ^ | / ^ exp |. 
i=i 

We refer to the attention mechanism as this whole process 
of computing the unnormalized relevance scores and nor¬ 
malizing them to obtain the attention weights. 

The attention mechanism allows the decoder to selec¬ 
tively focus on only a subset of frames by increasing the 
attention weights of the corresponding temporal feature. 
However, we do not explicitly force this type of selective 
attention to happen. Rather, this inclusion of the atten¬ 
tion mechanism enables the decoder to exploit the temporal 


structure, there is useful temporal structure in the data. 
Later in Sec. we empirically show that this is indeed the 
case. See Lig^for the graphical illustration of the temporal 
attention mechanism. 

4. Related Work 

Video description generation has been investigated and 
studied in other work, such as EDEllaSl. Most of these 
examples have, however, constrained the domain of videos 
as well as the activities and objects embedded in the video 
clips. Lurthermore, they tend to rely on hand-crafted vi¬ 
sual representations of the video, to which template-based 
or shallow statistical machine translation approaches were 
applied. In contrast, the approach we take and propose 
in this paper aims at open-domain video description gen¬ 
eration with deep trainable models starting from low-level 
video representations, including raw pixel intensities (see 
Sec. |2.2| ) and local motion features (see Sec. 

In this sense, the approach we use here is more closely 
related to the recently introduced static image caption gen¬ 
eration approaches based mainly on neural networks |[20| 
Ha Sami da. a neural approach to static image caption 
generation has recently been applied to video description 
generation by Venugopalan et al. ED. However, their di¬ 
rect adaptation of the underlying static image caption gener¬ 
ation mechanism to the videos is limited by the fact that the 
model tends to ignore the temporal structure of the under¬ 
lying video. Such structure has demonstrated to be helpful 
in the context of event and action classification isidsiia, 
and is explored in this paper. Other recent work GZl has 
explored the use of DVS annotated video for video descrip¬ 
tion research and has underscored the observation that DVS 
descriptions are typically much more relevant and accurate 
descriptions of the visual content of a video compared to 
movie scripts. They present results using both DVS and 
script based annotations as well as cooking activities. 

While other work has explored 3-D Deep Networks for 
video (361 |T6| [T^ [30l our particular approach differs in a 
number of ways from prior work in that it is based on CNNs 
as opposed to other 3-D deep architectures and we focus on 


































































pre-training the model on a number of widely used action 
recognition datasets. In contrast to other 3-D CNN for¬ 
mulations, the input to our 3-D CNN consists of features 
derived from a number of state of the art image descrip¬ 
tors. Our model is also fully 3-D in that we model en¬ 
tire volumes across a video clip. In this paper, we use a 
state-of-the-art static convolutional neural network (CNN) 
and a novel spatio-temporal 3-D CNN to model input video 
clips. This way of modeling video using feedforward con¬ 
volutional neural networks, has become increasingly pop¬ 
ular recently ED [la ED. However, there has also been 
a stream of research on using recurrent neural networks 
(RNN) for modeling video clips. For instance, in 1^ . 
Srivastava et al. propose to use long short-term memory 
units to extract video features. Ranzato et al. in 1^ also 
models a video clip with an RNN, however, after vector- 
quantizing image patches of the video clip. In contrast to 
other approaches such as ca, which have explored CNN- 
RNN coupled models for video description, here we use an 
attention mechanism, use a 3-D CNN and focus on open- 
domain video description. 

5. Experiments 

We test the proposed approaches on two video¬ 
description corpora: Youtube2Text O and DVS (381. Im¬ 
plementations are available at https : //github .com/ 
yaoli/arctic-capgen-vid, 


Description Preprocessing We preprocess the descrip¬ 
tions in both the Youtube2Text and DVS datasets with 
wordpunct-tokenizer from the NLTK toolbox]^ We 
did not do any other preprocessing such as lowercasing and 
rare word elimination. After preprocessing, the numbers of 
unique words were 15,903 for Youtube2Text and 17,609 for 
DVS Dataset. 

Video Preprocessing To reduce the computational and 
memory requirement, we only consider the first 240 frames 
of each video H For appearance features, (trained) 2- 
D GoogLeNet CNN is used to extract fixed-length 
representation (with the help of the popular implemen¬ 
tation in Caffe mi). Features are extracted from the 
pool5/7x7_sl layer. We select 26 equally-spaced frames 
out of the first 240 from each video and feed them into the 
CNN to obtain a 1024 dimensional frame-wise feature vec¬ 
tor. We also apply the spatio-temporal 3-D CNN (trained 
as described in Sec. |5.2| ) in order to extract local motion 
informatioij^ When using 3-D CNN without temporal at¬ 
tention, we simply use the 2500-dimensional activation of 
the last fully-connection layer. When we combine the 3-D 
CNN with the temporal attention mechanism, we leverage 
the last convolutional layer representation leading to 26 fea¬ 
ture vectors of size 352. Those vector are contatenated with 
the 2D CNN features resulting in 26 feature vectors with 
1376 elements. 

5.2. Experimental Setup 


5.1. Datasets 

Youtube2Text The Youtube2Text video corpus (71 is well 
suited for training and evaluating an automatic video de¬ 
scription generation model. The dataset has 1,970 video 
clips with multiple natural language descriptions for each 
video clip. In total, the dataset consists of approximately 
80,000 video / description pairs, with the vocabulary of 
approximately 16,000 unique words. The dataset is open- 
domain and covers a wide range of topics including sports, 
animals and music. Following (411 . we split the dataset into 
a training set of 1,200 video clips, a validation set of 100 
clips and a test set consisting of the remaining clips. 


DVS The DVS dataset was recently introduced in (38l 
with a much larger number of video clips and accompa¬ 
nying descriptions than the existing video/description cor¬ 
pora such as Youtube2Text. It contains video clips extracted 
from 92 DVD movies along with semi-automatically tran¬ 
scribed descriptive video service (DVS) narrations. The 
dataset consists of 49,000 video clips covering a wide vari¬ 
ety of situations. We follow the standard split of the dataset 
into a training set of 39,000 clips, a validation set of 5,000 
clips and a test set of 5,000 clips, as suggested by (38]| . 


Models We test four different model variations for video 
description generation based on the underlying encoder- 
decoder framework, with results presented in Table Enc- 
Dec (Basic) denotes a baseline incorporating neither local 
nor global temporal structure. Is it based on an encoder 
using the 2-D GoogLeNet CNN (341 as discussed in Sec¬ 
tion 12.21 and the LSTM-based decoder outlined in Section 


2.3 Enc-Dec + Local incorporates local temporal struc¬ 


ture via the integration of our proposed 3-D CNN features 
(as outlined in Section [3T] ) with the 2-D GoogLeNet CNN 
features as described above. Enc-Dec + Global adds the 
temporal attention mechanism of Section 3.2 Finally, Enc- 
Dec + Local + Global incorporates both the 3-D CNN and 
the temporal attention mechanism into the model. All mod¬ 
els otherwise use the same number of temporal features 
v^. These experiments will allow us to investigate whether 
the contributions from the proposed approaches are com¬ 
plimentary and can be combined to further improve perfor¬ 
mance. 


^ http:/s/www.nltk.org/index.html 

^ When the video clip has less than 240 frames, we pad the video with 
all-zero frames to make it into 240-frame long. 

^ We perturb each video along three axes to form random crops by 
taking multiple 15 x 15 x 120 cuboids out of the original 20 x 20 x 120 
cuboids, and the hnal representation is the average of the representations 
from these perturbed video clips. 





Table 1. Performance of different variants of the model on the Youtube2Text and DVS datasets. 


Model 

BLEU 

Youtube2Text 

METEOR CIDEr 

Perplexity 

BLEU 

DVS 

METEOR CIDEr 

Perplexity 

Enc-Dec (Basic) 

0.3869 

0.2868 

0.4478 

33.09 

0.003 

0.044 

0.044 

88.28 

Local (3-D CNN) 

0.3875 

0.2832 

0.5087 

33.42 

0.004 

0.051 

0.050 

84.41 

Global (Temporal Attention) 

0.4028 

0.2900 

0.4801 

27.89 

0.003 

0.040 

0.047 

66.63 

Local Global 

0.4192 

0.2960 

0.5167 

27.55 

0.007 

0.057 

0.061 

65.44 

Venugopalan et al. II4III 

0.3119 

0.2687 

- 

- 

- 

- 

- 

- 

Extra Data (Elickr30k, COCO) 

0.3329 

0.2907 

- 

- 

- 

- 

- 

- 

Thomason et al 1371 

0.1368 

0.2390 

- 

- 

- 

- 

- 

- 


Training For all video description generation models, we 
estimated the parameters by maximizing the log-likelihood: 

N tn 

logp{yf I 

n=l i=l 

where there are N training video-description pairs (x’^, y'^), 
and each description is tn words long. 

We used Adadelta i46l with the gradient computed by 
the backpropagation algorithm. We optimized the hyper¬ 
parameters (e.g. number of LSTM units and the word em¬ 
bedding dimensionality) using random search to maximize 
the log-probability of the validation set. Training contin¬ 
ued until the validation log-probability stopped increasing 
for 5,000 updates. As mentioned earlier in Sec. O the 3- 
D CNN was trained on activity recognition datasets. Due 
to space limitation, details regarding the training and eval¬ 
uation of the 3-D CNN on activity recognition datasets are 
provided in the Supplementary Material. 

Evaluation We report the performance of our proposed 
method using test set perplexity and three model-free au¬ 
tomatic evaluation metrics. These are BLEU fISl . ME¬ 
TEOR ifm and CIDEr ||40l. We use the evaluation script 
prepared and introduced in lH . 

5.3. Quantitative Analysis 

In the first block of Table we present the performance 
of the four different variants of the model using all four met¬ 
rics: BLEU, METEOR, CIDEr and perplexity. Subsequent 
lines in the table give comparisons with prior work. The first 
three rows (Enc-Dec (Basic), -^Local and -^Global), show 
that it is generally beneficial to exploit some type of tempo¬ 
ral structure underlying the video. Although this benefit is 
most evident with perplexity (especially with the temporal 
attention mechanism exploiting global temporal structure), 
we observe a similar trend with the other model-free metrics 
and across both Youtube2Text and DVS datasets. 

We observe, however, that the biggest gain can be 
achieved by letting the model exploit both local and global 

^ Refer to the Supplementary Material for the selected hyperparame¬ 
ters. 


temporal structure (the fourth row in Table [^. We observed 
this gain consistently across both datasets as well as using 
all four automatic evaluation metrics. 

5.4. Qualitative Analysis 

Although the model-free evaluation metrics such as the 
ones we used in this paper (BLEU, METEOR, CIDEr) were 
designed to refiect the agreement level between reference 
and generated descriptions, it is not intuitively clear how 
well those numbers (see Table refiect the quality of the 
actual generated descriptions. Therefore, we present some 
of the video clips and their corresponding descriptions, both 
generated and reference, from the test set of each dataset. 
Unless otherwise labeled, the visualizations in this section 
are from the best model which exploits both global and local 
temporal structure (the fourth row of Table [T]). 

In Eig. two video clips from the test set of 
Youtube2Text are shown. We can clearly see that the gen¬ 
erated descriptions correspond well with the video clips. In 
Eig.|^ we show also two sample video clips from the DVS 
dataset. Clearly, the model does not perform as well on the 
DVS dataset as it did on Youtube2Text, which was already 
evident from the quantitative analysis in Sec. |5.3| However, 
we still observe that the model often focuses correctly on 
a subset of frames according to the word to be generated. 
Eor instance, in the left pane, when the model is about to 
generate the second “SOMEONE”, it focuses mostly on the 
first frame. Also, on the right panel, the model correctly at¬ 
tends to the second frame when the word “types” is about to 
be generated. As for the 3-D CNN local temporal features, 
we see that they allowed to correctly identify the action as 
“frying”, as opposed to simply “cooking”. 

More samples of the video clips and the gener¬ 
ated/reference descriptions can be found in the Supplemen¬ 
tary Material, including visualizations from the global tem¬ 
poral attention model alone (see the third row in Table [^. 

6. Conclusion 

In this work, we address the challenging problem of pro¬ 
ducing natural language descriptions of videos. We iden¬ 
tify and underscore the importance of capturing both lo- 












+Local+Global: A man and a woman are talking on the 


Ref: A man and a woman ride a motorcycle 



+Local+Global: Someone is frying a fish in a 


+Local: Someone is frying something 
+Global: The person is cooking 
Basic: A man cooking its kitchen 

Ref: A woman is frying food 




+Local+Global: the girl grins at him 


Ref: SOMEONE and SOMEONE swap a look 



+Local+Global: as SOMEONE sits on the table, 


shifts his gaze to SOMEONE 

+Local: with a smile SOMEONE arrives 
+Global: SOMEONE sits at a table 
Basic: now, SOMEONE grins 

Ref: SOMEONE gaze at SOMEONE 


Figure 4. Four sample videos and their corresponding generated and ground-truth descriptions from Youtube2Text (Left Column) and 
DVS (Right Column). The bar plot under each frame corresponds to the attention weight a] for the frame when the corresponding word 
(color-coded) was generated. From the top left panel, we can see that when the word “road” is about to be generated, the model focuses 
highly on the third frame where the road is clearly visible. Similarly, on the bottom left panel, we can see that the model attends to the 
second frame when it was about to generate the word “Someone”. The bottom row includes alternate descriptions generated by the other 
model variations. 


cal and global temporal structure in addition to frame-wise 
appearance information. To this end, we propose a novel 
3-D convolutional neural network that is designed to cap¬ 
ture local fine-grained motion information from consecutive 
frames. In order to capture global temporal structure, we 
propose the use of a temporal attentional mechanism that 
learns the ability to focus on subsets of frames. Finally, 
the two proposed approaches fit naturally together into an 
encoder-decoder neural video caption generator. 

We have empirically validated each approach on both 
Youtube2Text and DVS datasets on four standard evalua¬ 
tion metrics. Experiments indicate that models using ei¬ 
ther approach improve over the baseline model. Further¬ 
more, combining the two approaches gives the best perfor¬ 
mance. In fact, we achieved the state-of-the-art results on 
Youtube2Text with the combination. 

Given the challenging nature of the task, we hypothesize 
that the performance on the DVS dataset could be signifi¬ 
cantly improved by incoporating another recently proposed 
dataset (27) similar to the DVS data used here. In addition, 
we have some preliminary experimental results that indicate 
that further performance gains are possible by leveraging 
image caption generation datasets such as MS COCO O 


and Flickr ca. We intend to more fully explore this direc¬ 
tion in future work. 
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7. Details of experiments 
7.1. 3-D CNN 


A; Low-level Video Representation 

15X15X120 crops 

± 


B: 3D Convolutional Networks 



HOG 


HOF MBH 


Figure 5. Illustration of the spatio-temporal 
convolutional neural network (3-D CNN). 
This network is trained for activity recogni¬ 
tion. Then, only the convolutional layers are 
involved when generating video descriptions. 


The 3-D CNN architecture is specified in Figure Our model is composed of three 3-D convolutional layers, using 
3x3x3 kernels. The number of output features after the different convolutions is given in FigureEach convolutional layer 
is followed by a rectified linear activations (ReLU) and local max-pooling. After the convolutions, a fully-connected layer 
(dimension 2500, with ReLU activation) is applied, followed by a Softmax layer. A dropout of 0.5 is applied on those last 
two layers. 

Multitask learning is used to train the model on three human activity recognition datasets: UCFIOI ED with 13320 
Youtube videos and 101 various human activity classes, HMDB51 ll23l with 3700 videos and 51 various human activity 
classes, and a random subset of Sports-IM dataset CD using 50,000 videos that have 487 sports labels. 

We trained the 3-D CNN using stochastic gradient descent with a momentum of 0.7. Learning rate is initially set to 0.1 
and then is decreased, using the following scheme 0.05, 0.02, 0.01, each time the validation cost stagnate. At each iteration a 
minibatch of size 48 is constructed by sampling uniformly all 3 human activity datasets. We perturb each video along three 
axes to encourage the model to learn invariant feature representation, we take random crops of size 15x15x120 cuboids out 
of the original 20 x 20 x 120 cuboids. Video are also randomly flipped. 

Despite our interest in video-description, we validate that our model obtains reasonable performances on the activity 
recognition task using HMDB51 (split 1) and UCLIOI. On HMDB (split 1) our model achieves an accuracy of 52.3%, our 
result is 3% lower than the best motion-based single model (temporal-based CNN of 1301 ). On UCL-101, our 3-D CNN 
obtain an accuracy of 76.49%. While the temporal-based CNN l30l achieves 83.7%, our model outperforms other single 3-D 
convolution based approaches such as C3D (72.29%) 1^ and slow-fusion convnet (65.4%) lT9l . 

7.2. Encoder-Decoder Model Training 

Hyperparameters reflects the learning capacity of the models. We have made sure each type of models have been sufficient 
explored in their hyperparameters. The model selections on both Youtube2Text and DVS are performed by random search fH. 
There are four types of models being trained: 

• Basic Enc-Dec 

• Basic Enc-Dec + Local (3-D ConvNet) 

• Basic + Global (Temporal Attention) 

• Basic + Local + Global 

Lor each of four types of models, we performed 50 experiments with random search on the critical hyperparameters. Each 
of the 50 experiments is associated with a specific hyperparameter setup. And the 50 setups are shared cross all four types of 
models. The critial hyperparameters experimented are: 

• the dimensionality of word embedding, in the range of [100, 1000] 





























































• the dimensionality of LSTM hidden/memory states, in the range of [100,3000] 

• dropout, either use or not used, decided at random. 

The same procedure is used on both Youtube2Text and DVS datasets. 


Table 2. Hyperparameters of best models on Youtube2Text. 


model 

emb 

Istm 

dropout 

Basic Enc-Dec 

211 

1096 

True 

Basic Enc-Dec -i- Local (3-D ConvNet) 

161 

1292 

True 

Basic -1- Global (Temporal Attention) 

476 

2231 

True 

Basic -1- Local -i- Global 

454 

1714 

True 


Table 3. Hyperparameters of best models on DVS. 

model 

emb 

Istm 

dropout 

Basic Enc-Dec 

345 

1014 

True 

Basic Enc-Dec -i- Local (3-D ConvNet) 

512 

2560 

True 

Basic -1- Global (Temporal Attention) 

656 

1635 

True 

Basic -1- Local -i- Global 

454 

1714 

True 


8. Inspecting the learned soft-attention coefficients a 

We illustrate the caption generation process of the proposed soft-attentional models trained with Basic+Global v.s. trained 
with Basic+Local+Global, with a dynamic a on frames for each word in the generated caption. 

The bar chart shows the magnitude of a. The generated caption is shown on the left. Each generated word corresponds 
to an a vector, show in the same row. Each bar corresponds to a particular frame on the very top of the figure, organized 
sequentially. Within the same row, the height of the bar shows the importance of its corresponding frame in generating that 
word. 20 frames are shown for better visibility. 

8.1. Caption generation and a visualization on Youtube2Text testset 
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Figure 6. Model type: Basic + Global. Model shifts its attention across frames to generate a caption. The bar char shows the magitude of 
o, sum to 1 row-wise, the higher the bar, the bigger the magnitude. 
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Figure 7. Model type: Basic + Local + Global. Model shifts its attention across frames when generating the caption. The bar char shows 
the magitude of a, sum to 1 row-wise, the higher the bar, the bigger the magnitude. It is doing a better job at guessing the object being 
chopped , compared with Figure 
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Figure 8. Model type: Basic -i- Global. Model shifts its attention across frames to generate a caption. The bar char shows the magitude of 
a, sum to 1 row-wise, the higher the bar, the bigger the magnitude. 
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Figure 9. Model type: Basic + Local + Global. Model shifts its attention across frames when generating the caption. The bar char shows 
the magitude of a, sum to 1 row-wise, the higher the bar, the bigger the magnitude. The use of additional motion features offers more 
faithful description of the action than the one without (“running” v.s. “walking” in Figurej^. 
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Figure 10. Model type: Basic -i- Global. Model shifts its attention across frames to generate a caption. The bar char shows the magitude 
of a, sum to 1 row-wise, the higher the bar, the bigger the magnitude. 
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Figure 11. Model type: Basic + Local + Global. Model shifts its attention across frames to generate a caption. The bar char shows the 
magitude of a, sum to 1 row-wise, the higher the bar, the bigger the magnitude. SDConv^tt generates a more faithful description with a 
much richer content than Figure[^ It even learns to generate a rare work “teasing”. 







8.2. Caption generation and a visualization on DVS testset 

This section illustrates on DVS, a much more challenging dataset. See the following figures for detailed explaination of 
soft-attention applied on videos with different properties. 


SOMEONE 



nods 




Figure 12. Model type: Basic + Global. The model tends to produce a smooth distribution in a row-wise, due to the uniformity of the 
scene with a slowly changing continuous shot. 








Figure 13. Model type: Basic + Local + Global. The model learns a smooth a on the slowly changing scene. It captures a different action 
from Basic + Global in Figure 
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Figure 14. Model type: Basic + Global, ol 

also reflects the sudden transition between two shots. 
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Figure 15. Model type: Basic + Local + Global. The learned model generates a more sophiscated description than Figureattempting 
to incoporate character-level interaction inside the first part of the scene. 
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Figure 16. Model type: Basic + Global. The model seems to focus on the second shot of the scene at the beginning, yet the part of the 
generated caption “out of the car” distributes a decent amount of its attention on the first scene as well. This may due to the fact that the 
memory of decoding LSTM already contains the information of almost the entire scene (two shots). 
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Figure 17. Model type: Basic + Local + Global. The learned model generates a more sophiscated description than FigureThe model 
focuses on the car in the second shot when generating “sit”, “back seat”. When generating two “SOMEONE”, it divides its attentio among 
two shots. 







Figure 18. Model type: Basic + Global. The description is argubly not very accurate 
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Figure 19. Model type: Basic + Local + Global. With the help of additional features, the model successfully describes the cell phone and 
the room, a much faithful description than Figure 



